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Abstract: Text data, including speeches, stories, and other document forms, are often connected to 
sentiment variables that are of interest for research in marketing, economics, and elsewhere. It is also 
very high dimensional and difficult to incorporate into statistical analyses. This article introduces a 
straightforward framework of sentiment-sufficient dimension reduction for text data. Multinomial in- 
verse regression is introduced as a general tool for simplifying predictor sets that can be represented 
as draws from a multinomial distribution, and we show that logistic regression of phrase counts onto 
document annotations can be used to obtain low dimension document representations that are rich in 
sentiment information. To facilitate this modeling, a novel estimation technique is developed for multi- 
nomial logistic regression with very high-dimension response. In particular, independent Laplace priors 
with unknown variance are assigned to each regression coefficient, and we detail an efficient routine for 
maximization of the joint posterior over coefficients and their prior scale. This 'gamma-lasso' scheme 
yields stable and effective estimation for general high-dimension logistic regression, and we argue that 
it will be superior to current methods in many settings. Guidelines for prior specification are provided, 
algorithm convergence is detailed, and estimator properties are outlined from the perspective of the 
hterature on non-concave likelihood penalization. Related work on sentiment analysis from statistics, 
econometrics, and machine learning is surveyed and connected. Finally, the methods are appUed in two 
detailed examples and we provide out-of-sample prediction studies to illustrate their effectiveness. 
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1 Introduction 



This article investigates the relationship between text data - product reviews, political speech, 
financial news, or a personal blog post - and variables that are believed to influence its com- 
position - product quality ratings, political affiliation, stock price, or mood polarity. Such 
language-motivating observable variables, generically termed sentiment in the context of this 
article, are often the main object of interest for text mining applications. When, as is typical, 
large amounts of text are available but only a small subset of documents are annotated with 
known sentiment, this relationship yields the powerful potential for text to act as a stand-in for 
related quantities of primary interest. On the other hand, language data dimension (i.e., vocab- 
ulary size) is both very large and tends to increase with the amount of observed text, making the 
data difficult to incorporate into statistical analyses. Our goal is to introduce a straightforward 
framework of sentiment-preserving dimension reduction for text data. 

As detailed in Section [2|1, a common statistical treatment of text views each document as 
an exchangeable collection of phrase tokens. In machine learning, these tokens are usually just 
words (e.g., tax, pizza) obtained after stemming for related roots (e.g., taxation, taxing, and 
taxes all become tax), but richer tokenizations are also possible: for example, we find it useful 
to track common n-gram word combinations (e.g. bigrams pay tax or cheese pizza and trigrams 
such as too much tax). Under a given tokenization each document is represented as = 
[xii, . . . , Xjp]', a sparse vector of counts for each of p tokens in the vocabulary. These token 
counts, and the associated frequencies fj = yii/nii where nii = J2^=i ^ij^ '^hen the basic data 
units for statistical text analysis. In particular, the multinomial distribution for Xj implied by an 
assumption of token-exchangeability can serve as the basis for efficient dimension reduction. 

Consider n documents that are each annotated with a single sentiment variable, yi (e.g., 
restaurant reviews accompanied by a one to five star rating). A naive approach to text- sentiment 
prediction would be to fit a generic regression for yi|xj. However, given the very high dimen- 
sion of text-counts (with p in the thousands or tens of thousands), one cannot efficiently estimate 
this conditional distribution without also taking steps to simplify Xj. We propose an inverse re- 
gression (IR) approach, wherein the inverse conditional distribution for text given sentiment is 
used to obtain low dimensional document scores that preserve information relevant to y^. 

As an introductory example, consider the text- sentiment contingency table built by col- 
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lapsing token counts as = J2ryi=y ^^'^h y E y, the support of an ordered discrete 

sentiment variable. A basic multinomial inverse regression (MNIR) model is then 

Xy ~ MN(qy, my) with qyj = ^^pK + for j = 1, . . . yey (1) 

where each MN is a n-dimensional multinomial distribution with size m,, = V m, and 
probabilities = [g^i, . . . , g^p]' that are a linear function of y through a logistic link. Under 
conditions detailed in Section [3| the sufficient reduction (SR) score for fj = ^ilmi is then 

Zi = ip'ii ^ yi AL ^i,mi \ Zi. (2) 



Hence, given this SR projection, full x, is ignored and modeling the text-sentiment relation- 
ship becomes a univariate regression problem. This article's examples include linear, E[7/j] = 
Po+PiZi, quadratic, E[?/j] = I5q+ l5iZi+ I52zf , and logistic, p(yi < a) = (1 + exp[/3o + PiZi])~^ , 
forms for this /orwarJ regression, and SR scores should be straightforward to incorporate into 
alternative regression models or structural equation systems. The procedure rests upon assump- 
tions that allow for summary tables wherein the text- sentiment relationship of interest can be 
modeled as a logistic multinomial, but when such assumptions are plausible, as we find com- 
mon in text analysis, they introduce information that should yield significant efficiency gains. 

In estimating models of the type in ([T]), which involve many thousands of parameters, we 
propose use of fat-tailed and sparsity-inducing independent Laplace priors for each coefficient 
To account for uncertainty about the appropriate level of variable- specific regularization, 
each Laplace rate parameter \j is left unknown with a gamma hyperprior. Thus, for example, 

= ye-^^I^^I^Af ^e-^^^ .,r,A, > 0, (3) 

independent for each j under a Ga(s, r) hyperprior specification. This departure from the usual 
shared- A model is motivated in Section [3l3. 

Fitting MNIR models is tough for reasons beyond the usual difficulties of high dimension 
regression - simply evaluating the large-response likelihood is expensive due to the normal- 
ization in calculating each qj. As surveyed in Section]?} available cross-validation (e.g., via 
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solution paths) and fully Bayesian (i.e., through Monte-Carlo marginalization) methods for es- 
timating ipj under unknown \j are prohibitively expensive. A novel algorithm is proposed for 
finding the joint posterior maximum (MAP) estimate of both coefficients and their prior scale. 
The problem is reduced to log likelihood maximization for if with a non-concave penalty, and 
it can be solved relatively quickly through coordinate descent. For example, given the prior in 
Q, the log likelihood implied by ([!]) is maximized subject to (i.e., minus) cost constraints 

c(y^,-) = slog(l + |^,|/r) (4) 

for each coefficient. This provides a powerful new estimation framework, which we term the 
gamma-lasso. The approach is very computationally efficient, yielding robust SR scores in 
less than a second for documents with thousands of unique tokens. Indeed, although a full 
comparison is beyond the scope of this paper, we find that the proposed algorithm can also be 
far superior to current techniques for high-dimensional logistic regression in the more common 
large-predictor (rather than large-response) setting. 

This article thus includes two main methodological contributions. First, Section |3] intro- 
duces multinomial inverse regression as an IR procedure for predictor sets that can be rep- 
resented as draws from a multinomial, and details its application to text-sentiment analysis. 
This includes full model specification and general sufficiency results, guidelines on how text 
data should be handled to satisfy the MNIR model assumptions, and our independent gamma- 
Laplace prior specification. Second, Section]?] develops a novel approach to estimation in very 
high dimensional logistic regression. This includes details of coordinate descent for joint MAP 
estimation of coefficients and their unknown variance, conditions for global convergence, and 
an outline of estimator properties from the perspective of the literature on non-concave likeh- 
hood penalization. As background. Section ]2] briefly surveys the literature on text mining and 
sentiment analysis, and on dimension reduction and inverse regression. 

The following section describes language pre-processing and introduces two datasets that 
are used throughout to motivate and illustrate our methods. Performance comparison and de- 
tailed results for these examples are then presented in Section ]5| Both example datasets, along 
with all implemented methodology, are available in the textir package for R. 
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1.1 Data processing and examples 



Text is usually initially cleaned according to some standard information retrieval criteria, and 



we refer the reader to Jurafsky and Martin (2009) for an overview. In this article, we simply 
remove a limited set of stop words (e.g., and or but) and punctuation, convert to lowercase. 



and strip suffixes from roots according to the Porter stemmer (Porter, 1980). The main data 
preparation step is then to parse clean text into informative language tokens; as mentioned in 
the introduction, counts for these tokens are the starting point for statistical analysis. Most 



commonly (see, e.g., [Srivastava and Sahami[ |2009[ ) the tokens are just words, such that each 
document is treated as a vector of word-counts. This is referred to as the bag- of -words rep- 
resentation, since these counts are summary statistics for language generated by exchangeable 
draws from a multinomial 'bag' of word options. 

Despite its apparent limitations, the token-count framework can be made quite flexible 
through more sophisticated tokenization. For example, in the N-gram language model words 
are drawn from a Markov chain of order (see, e.g., Jurafsky and Martin] 2009[ ). A document 
is then summarized by its length- word sequences, or A^-gram tokens, as these are sufficient 
for the underlying Markov transition probabilities. Our general practice is to count common 
unigram, bigram, and trigram tokens (i.e., words and 2-3 word phrases). Another powerful 
technique is to use domain- specific knowledge to parse for phrases that are meaningful in the 



context of a specific field. Talley and O'Kane (2011 ) present one such approach for tokeniza- 
tion of legal agreements; for example, they use any conjugation of the word act in proximity of 
God to identify a common Act of God class of carve-out provisions. Finally, work such as that 
of Poon and Domingos ( 2009| ) seeks to parse language according to semantic equivalence. 

Thus while we focus on token-count data, different language models are able to influence 
analysis through tokenization rules. And although separation of parsing from statistical mod- 
eling limits our ability to quantify uncertainty, it has the appealing effect of allowing text data 
from various sources and formats to all be analyzed within a multinomial likelihood framework. 



1.1.1 Ideology in political speeches 



This example originally appears in Gentzkow and Shapiro| (GS; |2010 ) and considers text of the 
109"^ (2005-2006) Congressional Record. For each of the 529 members of the United States 



House and Senate, GS record usage of phrases in a list of 1000 bigrams and trigrams. Each 
document corresponds to a single person. The sentiment of interest is political partisanship, 
where party affiliation (Republican, Democrat, or Independent) provides a simple indicator 
and a higher-fidelity measure is calculated as the two-party vote-share from each speaker's 
constituency (congressional district for representatives; state for senators) obtained by George 
W. Bush in the 2004 presidential election. Note that token vocabulary in this example is influ- 
enced by sentiment: GS built contingency tables for bigram and trigram usage by party, and 
kept the top 1000 'most partisan' phrases according to ranking of their Pearson x^-test statistic. 

Define phrase frequency lift for a given group as fjc/fj^ where fjc is mean frequency for 
phrase j in group G and fj = Yl^=i fij/^ the average across all documents. The following 
tables show top-five lift phrases used at least once by each party. 

Democratic Frequency Lift Republican Frequency Lift 



congressional. hispanic.caucu 
medicaid. cut 
clean. drinking.water 
earth. day 
tax.cut.benefit 

1.1.2 On-line restaurant reviews 



2.163 
2.154 
2.154 
2.152 
2.149 



ayman.al.zawahiri 
america.blood.cent 
million. budget. request 
million. illegal. alien 
temporary, worker.prog ram 



1.850 
1.849 
1.847 
1.846 
1.845 



This dataset, which originally appears in the topic analysis of |Maua and Cozrnin (2009), con- 
tains 6260 user-submitted restaurant reviews (90 word average) from www . weSthere . com. 
The reviews are accompanied by a five-star rating on four specific aspects of quality - food, 
service, value, and atmosphere - as well as the overall experience. After tokenizing the text 
into bigrams (based on a belief that modifiers such as very or small would be useful here), we 
discard phrases that appear in less than ten reviews and documents which do not use any of the 
remaining phrases. This leaves a dataset of 6147 review counts for a token vocabulary of 2978 
bigrams. Top-five lift phrases that occur at least once in both positive (overall experience > 3) 
and negative (overall experience < 3) reviews are below. 

Negative Frequency Lift Positive Frequency Lift 



food poison 


5.402 


worth trip 


1.393 


food terribi 


5.354 


everi week 


1.390 


one worst 


5.339 


melt mouth 


1.389 


spoke manag 


5.318 


alway go 


1.389 


after left 


5.285 


one week 


1.389 
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2 Background 



This section briefly reviews the relevant literatures on sentiment analysis and inverse regression. 
Additional background is in the appendices and material specific to estimation is in Section |4j 



2.1 Analysis of sentiment in text 

As already outlined, we use sentiment to refer to any variables related to document composition. 
Although broader than its common 'opinion polarity' usage, this definition as 'sensible quality' 
fits our need to refer to the variety of quantities that may be correlated with text. 

Much of existing work on sentiment analysis uses word frequencies as predictors in generic 
regression and classification algorithms, including support vector machines, principle compo- 
nents (PC) regression, neural networks, and penalized least-squares. Examples from this ma- 



chine learning literature can be found in the survey by Pang and Lee (2008 1 and in the collection 



from Srivastava and Sahami (2009 ). In the social sciences, research on ideology in political text 



includes both generic classifiers (e.g., Yu et al.[ 2008 1 and analysis of contingency tables for 



individual terms (e.g., Laver et al. , 2003 ) (machine learning researchers, such as Thomas et al. 



2006 have also made contributions in this area). In economics, particularly finance, it is more 
common to rely on weighted counts for pre-defined lists of terms with positive or negative tone; 
examples of this approach include [Tetlock| ( [20071 ) and |Loughran and McDonald] ( |201 \) (again. 



machine learners such as BoUen et al. , 201 1 have also studied prediction for finance). 

These approaches all have drawbacks: generic regression does nothing to leverage the par- 
ticulars of text data, independent analysis of many contingency tables leads to multiple-testing 
issues, and pre-defined word lists are subjective and unreliable. A more promising strategy is 
to use text-specific dimension reduction based upon the multinomial implied by exchangeabil- 
ity of token-counts. For example, a topic model treats documents as drawn from a multino- 
mial distribution with probabilities arising as a weighted combination of 'topic' factors. Thus 
Xj ~ MN(ci;ii^i + . . . + ujiK0K, rrii), where topics Ok = [9ki ■ ■ ■ dkp]' and weights uji are prob- 
ability vectors. This framework, also known as latent Dirichlet allocation (LDA), has been 



widely used in text analysis since its introduction by Blei et al. (2003 1. 

The low dimensional topic-weight representation (i.e., ujj) serves as a basis for sentiment 
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analysis in the original Blei et al. article, and has been used in this way by many since. The 



approach is especially popular in political science, where work such as that of Grimmer (20101 



and Quinn et al. ( 2010| ) investigates political interpretation of latent topics (these authors re- 



strict Uik G {0, 1} such that each document is drawn from a single topic). Recently, |Blei and 



McAuliffe ( 2007[ ) have introduced supervised LDA (sLDA) for joint modeling of text and sen- 



timent. In particular, they augment topic model with a forward regression yi = f{(^i), such 
that token counts and sentiment are connected through shared topic-weight factors. 

Finally, our investigation was originally motivated by a desire to build a model-based ver- 



sion of the specific slant indices proposed by Gentzkow and Shapiro (2010), which are part of 



a general political science literature on quantifying partisanship through weighted-term indices 
(e.g., Laver et aL] 2003 1 ). Appendix A.l shows that the GS indices can be written as summation 



of phrase frequencies loaded by their correlation with measured partisanship (e.g., vote-share), 
such that slant is equivalent to first-order partial least-squares (PLS; |Woldl|1975| ). 



2.2 Inverse regression and sufficient reduction 

This article is based on a notion that, given the high dimension of text data, it is not possible to 
efficiently estimate conditional response y\x without finding a way to simplify x. The same idea 
motivates many of the techniques surveyed above, including LDA and sLDA, PLS/slant, and 
PC regression. A framework to unify techniques for dimension reduction in regression can be 
found in Cook's 2007 overview of inverse regression, wherein inference about the multivariate 
conditional distribution x|y is used to build low dimension summaries for x. 

Suppose that Vj is a i^-vector of response factors through which Xj depends on yi (i.e., Vj 
is a possibly random function of yi). Then Cook's linear IR formulation has Xj = $Vj + ej, 
where $ = [c^^ ■ ■ ■ cpj^] is ap x K matrix of inverse regression coefficients and is p- vector 
of error terms. Under certain conditions on var(ej), detailed by Cook, the projection Zj = $'xj 
provides a sufficient reduction (SR) such that yi is independent of x^ given Zj. As this implies 
p(xj|$'xj, yi) = p(xj|$'xi), SR corresponds to the classical definition of sufficiency for 'data' 
Xj and 'parameter' yi, but is conditional on unknown $ that must be estimated in practice. 
When such estimation is feasible, the reduction of dimension from p to K should make these 
SR projections easier to work with than the original predictors. 
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Many approaches to dimension reduction can be understood in context of this Unear IR 
model: PC directions arise as SR projections for the maximum likeUhood solution when Vj is 



unspecified (see, e.g., Cook 2007 1 and, following our discussion in A. 1, the first PLS direction 



is the SR projection for least-squares fit when Vj = A closely related framework is that of 
factor analysis, wherein one seeks to estimate Vj directly rather than project Xj into its lower 
dimensional space. By augmenting estimation with a forward model for ?/j|vj researchers are 



able to build supervised factor models; see, e.g., |West| ( |2003| ). 

The innovation of inverse regression, from Cook's 2007 paper and in earlier work including 
[Li1 ( |1991[ ) and |Bura and Cook] ( |2001| ), is to investigate the SR projections that result from ex- 
plicit specification for Vj as a function of Ui. Cook's principle fitted components are derived for 
a variety of functional expansions of yt, Li et al. ( 2007| ) interprets PLS within an IR framework. 



and the sliced inverse regression of Li (1991 1 defines Vj as a step-function expansion of Ui. 
Since in each case the Vj are conditioned upon, these IR models are more restrictive than the 
random joint forward-inverse specification of supervised factor models. But if the IR model 
assumptions are satisfied then its parsimony should lead to more efficient inference. 

Instead of a linear equation, dimension reduction for text data is based on multinomial 
models. Following the topic model factor specification, LDA is akin to PC analysis for multi- 
nomials and sLDA is the corresponding supervised factor model. However, existing work on 



non-Gaussian inverse regression relies on conditional independence; for example. Cook and Li 



(2009) use single-parameter exponential families to model each Xij\vi. To our knowledge, no 



one has investigated SR projections based on the multinomial predictor distributions that arise 
naturally for text data. Hence, we seek to build a multinomial inverse regression framework. 



3 Modeling 



The subject-specific multinomial inverse regression model has, for i = 1, . . . , n: 



Xi ~ MN(qi, nii) with g^^ 



j = 1, . . . ,p, where r]ij = aj + Uij + v[(pj. (5) 



This generalizes ([T]) with the introduction of f^-dimensional response factors Vj and subject 
effects Uj = [uii ■ ■ ■ Uip]'. Section|3]l derives sufficient reduction results for projections ZiUii = 



$'xi, where = [^p^, ■ ■ ■ c^^]. Section [3|2 then describes application of these results in text 
analysis and outlines situations where ([5]) can be replaced with a collapsed model as in ([T]). 
Finally, |3]3 presents prior specification for these very high dimensional regressions. 

3.1 Sufficient reduction in multinomial inverse regression 

This section establishes classical sufficiency-for-y (conditional on IR parameters) for projec- 
tions derived from the model in ([5]). The main result follows, due to use of a logit link on 77^ = 
Vlii ■ ■ ■ ^jp]'' from factorization of the multinomial's natural exponential family parametrization. 

Proposition 3.1 . Under the model in (|5]), conditional on rrii and 

l/i _LL Xj I Vi ^ ?/j _LL Xj I $'xj. 

Proof. Setting aij = aj + Uij and suppressing i, the likelihood is (^) exp [k't] — A{ri)] = 
(™)e'''" exp [(x'$)v — ^(77)] = /i(x)5(($'x, v), where A{r]) = m log J2'j=i ^"^^ ■ Hence, the 



usual sufficiency factorization (e.g., Schervish 1995, 2.21) implies p(x|$'x, v) = p(x|$'x). 



and V is independent of x given $'x. Finally, p(|/|x, $'x) = J^p(?/|v)(iP(v|#'x) = p(y|$'x). 

Second, it is standard in text analysis to control for document size by regressing yi onto fre- 
quencies rather than counts. Fortunately, our sufficient reductions survive this transformation. 

Proposition 3.2. Ifi/i _LL Xj | $'xj, mi and p{y \ Xj) = p(?/j | fj), then _LL Xj | Zj = $'fj. 

Proof. We have that each of f and [$'f , m] are sufficient for y in p(x|?/) = MN(q, m)p(m|y). 
Under conditions of |Lehmann and Sheffe| ( [T950l 6.3), there exists a minimal sufficient statistic 



T(x) and functions g and g such that g{i) = T(x) = ^($'f , m). Having g vary with m, while 
g{f) does not, implies that the map $'f has introduced such dependence. But since m cannot 
be recovered from f , this must be false. Thus g = ^($'f ), and z = $'f is sufficient for y. 

3.2 MNIR for sentiment in text: coUapsibility and random effects 

For text- sentiment response factor specification, we focus on untransformed Vi = yi and dis- 
cretized Vi = step(?/j) along with their analogues for multivariate sentiment. The former is 
appropriate for categorical sentiment (e.g., political party, or 1-5 star rating) and, for reasons 



10 



discussed below, the latter is used with continuous sentiment (e.g., vote-share is rounded to the 
nearest whole percentage, and in general one can bin and average y by quantiles). Regardless, 



our methods apply under generic v(yj) including, e.g., the expansions of Cook (2007). 

Given this setting of discrete Vj, MNIR estimation can often be based on the collapsed 
counts that arise by aggregating within factor level combinations. For example, since sums 
of multinomials with equal probabilities are also multinomial, given shared intercepts (i.e., 
Uij = 0) and writing the support of Vj as V, the likelihood for the model in ([5]) is exactly the 
same as that from, for v G V with Xv = X]i vi=v ^« ^^'^ "^v = Z]rv,=v 

Xv ~ MN(qv, mv), where q^j = =p — and rj^j = aj + \ifij. (6) 

Since pooling documents in this way leaves only as many 'observations' as there are levels in 
the support of Vj, it can lead to dramatically less expensive estimation. 

Under the marginal model of (|6]), $ is the population average effect of v on x. One needs 
to be careful in when and how estimates from this model are used in SR projection, since condi- 
tional document-level validity of these results is subject to the usual coUapsibility requirements 
for analysis of categorical data (e.g.. Bishop et al.[ 1975^ . In particular, omitted variables must 



be conditionally independent of x^ given v^; this can usually be assumed for sentiment-related 
variables (e.g., a congress person's voting record is ignored given their party and vote-share). 
Covariates that act on Xj independent of Vj should be included in MNIR, as part of the equation 
for subject effects Uj (e.g., although it is not considered in this article, it might be best to condi- 
tion on geography when regressing political speech onto partisanship). The sufficient reduction 



result of (3.1 ) is then conditional on these sentiment-independent variables, such that they (or 
their SR projection) may need to be used as inputs in forward regression. 

It is often unreasonable to assume that known factors account for all variation across docu- 
ments, and treating the Uj of ([5]) as random effects independent of Vj provides a mechanism for 
explaining such heterogeneity and understanding its effect on estimation. Omitting Uj _LL Vj 



tends to yield estimated $ that is attenuated from its correct document-specific analogue (Gail 



et al.[ 1984[ ), although the population-average estimators can be reliable in some settings; for 



example, Zeger et al. (1985) show consistency for the stationary distribution effect of covari- 
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ates when the Uj encode temporal dependence (such as that between consecutive tokens in an 
A^-gram text model). When their influence is considered negligible, it is common to simply 
ignore the random effects in estimation. In this article we also consider modeling e"'^ as in- 
dependent gamma random variables, and use this to motivate a prior in [3|3 for the marginal 
random effects in a collapsed table. Another option would be to incorporate latent topics into 
MNIR and parametrize Uj through a linear factor model; this is especially appealing since SR 
projections onto estimated factor scores could then be used in forward regression. 

This last point - on random effects and forward regression - is important: when $ is 
estimated with random effects, Section [3jl only establishes sufficiency of Zj conditional on 
Uj. Marginal sufficiency would follow from p(vj|uj, $'xj) = p(vj|$'xj), which for Uj _LL Vj 
requires Uj _LL $'xj. Thus, information about Vj from this marginal dependence is lost when 
(as is usually necessary) is omitted in regression of Vj onto Zj. Section [5] shows that random 
effects in MNIR can be beneficial even if they are then ignored in forward regression. However, 
SR projection onto parametric representations of Uj is an open research interest. 

It is clear that there are many relevant issues to consider when assessing an MNIR model, 
and it is helpful to have our sentiment regression problem placed within the well studied frame- 
work of contingency table analysis (e.g., Agresti[ 2002[ is a general reference). Ongoing work 



centers on inference according to specific dependence structures or random effect parametriza- 
tions. However, as illustrated in Section |5} even very simple MNIR models - measuring popu- 
lation average effects - allow SR projections that are powerful tools for forward prediction. 

3.3 Prior specification 

To complete the MNIR model, we provide prior distributions for the intercepts ck, loadings 
and possible random effects U = [uv^ ■ ■ ■ Uv^]', where d is the number of points in V. 

First, each phrase intercept is assigned an independent standard normal prior, aj ~ N(0, 1). 
This serves to identify the logistic multinomial model, such that there is no need to spec- 
ify a null category, and we have found it diffuse enough to accomodate category frequencies 
in a variety of text and non-text examples. Second, we propose independent Laplace pri- 
ors for each (pjk, with coefficient-specific precision (or 'penalty') parameters Xjk, such that 
Tc{(Pjk) = exp(— Ajfc|v9j,fc|) for j = 1 . . .p and k = 1 . . . K. The implied prior stan- 
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dard deviation for (^jk is \/2/\jk. Each \jk is then assigned a conjugate gamma hyperprior 
Ga(Ajfc; s, r) = r'^ /T(s)\j'f^^e^^^^'', yielding the joint gamma-Laplace prior introduced in (Sj). 
Hyperprior shape, s, and rate, r, imply expectation s/r and variance s/r^ for each Xj^. 

As an example specification, consider variation in empirical token probabilities by level 
of the logical variables 'party = republican' for congressional speech and 'rating > 3' for 
weSthere reviews. Standard deviation of finite log(gtrue,j/^faise,j) across tokens is 1.9 and 1.4 
respectively, and given variables normalized to have var(f ) = 1 these deviations in log-odds 
correspond to a jump of two in v (from approximately -1 to 1). Hence, a coefficient standard 
deviation of around 0.7, implying E[Ajjt] = 2, is at the conservative (heavy penalization) end 
of the range indicated by informal data exploration, recommending the exponential Ga(l, 1/2) 
as a penalty prior specification. In Section [5] we also consider shapes of 1/10 and 1/100, thus 
decreasing E[Ajfc] by two orders of magnitude, and find performance robust to these changes. 

The above models have, with s < 1, hyperprior densities for Lpj^ that are increasing as the 
penalty approaches zero (i.e., at MLE estimation). This strategy has performed well in many 
applications, both for text analysis and otherwise, when dimension is not much larger than 10'^. 
However, in examples with vocabulary sizes reaching 10^ and higher, it is useful to increase 
both shape and rate for fast convergence and to keep the number of non-zero term loadings 
manageably small. As an informal practical recipe, if estimated $ is less sparse than desired 
and you suspect overfit, increase s. Following the discussion in|4j3 on hyperprior variance and 
algorithm convergence, if the optimization is taking too long or getting stuck in a minor mode, 
multiply both s and r by a constant to keep E[Ajjt] unchanged while decreasing var[Ajfc]- 

Finally, we use exp[Mjj] ~ Ga(l, 1) independent for each i and j as an illustrative random 
effect model. Considering e"'^ as a multiplier on relative odds, its mode at zero assumes some 
tokens are inappropriate for a given document, the mean of one centers the model on a shared 
intercept, and the fat right tail allows for occasional large counts of otherwise rare tokens. 
Counts are not immediately collapsable in the presence of random effects, but assumptions on 
the generating process for Xj unconditional on can be used to build a prior model for their 
effect on aggregated counts: if each Xij is drawn independent from a Poisson Po(e°'J"'""^J'^'^*'^^) 
with exp[Mjj] ~ Ga(l, 1), and = J2i '^Wi=v], then x^j ~ Po(e"^+"^^+'^'^j ) with exp[Mvj] *~ 
Ga(nv, 1). For convenience, we use a log-Normal approximation to the gamma and specify 
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■Uvj ~ N(log(riv) — 0.5(7^, (7^) wither^ = log(nv + l) — log(riv)- Note that cr^ — )■ as nv grows, 
leading to static u^j whose effect is equivalent to multiplying both numerator and denominator 
of ewlVv.j]/ J2i ^^P[Vv,i] by a constant. Thus modeling random effects is unnecessary under 
our assumed model after aggregating large numbers of observations. 



3. 3. 1 Motivation for independent gamma-Laplace priors 



One unique aspect of this article's approach is the use of independent gamma-Laplace priors 
for each regression coefficient y^j^. Part of the specification should not be surprising: the 
Laplace provides, as a scale-mixture of normal densities, a widely used robust alternative to 



the conjugate normal prior (e.g., Carlin et al. 1992). It also encourages sparsity in $ through 
a sharp density spike at Lpjk = 0, and MAP inference with fixed \jk is equivalent to likelihood 



maximization under an Li -penalty in the lasso estimation and selection procedure of Tibshrani 



( 1996). Similarly, conjugate gamma hyperpriors are a common choice in Bayesian inference 
for lasso regression (e.g., [Park and CaseIIal|2008[ ). 

However, our use of independent precision for each coefficient, rather than a single shared 
A, is a departure from standard practice. We feel that this provides a better representation of 
prior utihty, and it avoids the overpenalization that can occur when inferring a single coefficient 
precision on data with a large proportion of spurious regressors. In their recent work on the 



Horseshoe prior, Carvalho et al. (2010) illustrate general practical and theoretical advantages of 
an independent parameter variance specification. As detailed in Section]?} our model also yields 
an estimation procedure, labeled the gamma-lasso, that corresponds to likelihood maximization 
under a specific nonconcave penalty; the estimators thus inherit properties deemed desirable by 



authors in that literature (beginning from Fan and Li 2001 ) 



Finally, given the common reliance on cross-validation (CV) for lasso penalty selection, it 
is worth discussing why we choose to do otherwise. First, our independent \jk penalties would 
require a CV search of impossibly massive dimension. Moreover, CV is just an estimation 
technique and, like any other, is sensitive to the data sample on which it is applied. As an 
illustration. Section |5| 1 contains an example of CV-selected penalty performing far worse in 
out-of- sample prediction than those inferred under a wide range of gamma hyperpriors. CV is 
also not scaleable: repeated training and validation is infeasible on truly large applications (i.e.. 
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when estimating the model once is expensive). That said, one may wish to use CV to choose 
s or r in the hyperprior; since results are less sensitive to these parameters than they are to a 
fixed penalty, a small grid of search locations should suffice. 



4 Estimation 

Following our model specification in Section [3| the full posterior distribution of interest is 

n p K 

p($, a, A, U I X, V) oc n n 0, al) J] GL(y;,fc, A.^) (7) 

i=l 3=0 k=l 

where qij = exp[r7y]/ ^ILi ^^PiVu] with r]ij = aj + Uij + Ylk=i '"ikVjk and GL is our gamma- 
Laplace joint coefficient-penalty prior Laplace((/9jfc; \jk)Qa.{\jk] r, s). We only consider here 



u 



or Uij *~ N(0, af) for TT{uij), although sentiment-independent covariates can also be 



^3 



included trivially as additional dimensions of Vj. Note that 'i' denotes an observation, but that 
in MNIR this will often be a combination of documents after the aggregation of Section [3}2. 

Bayesian analysis of logistic regression typically involves posterior simulation, e.g. through 
Gibbs sampling with latent variables ( Holmes and Held] 2006) or Metropolis sampling with 



posterior-approximating proposals (Rossi et al. , 2005[ ). Despite recent work on larger datasets 



and sparse signals (e.g., |Gramacy and Polson[[2012| ), our experience is that these methods are 



too slow for text analysis applications. Even the more modest goal of posterior maximiza- 
tion presents considerable difficulty: unlike the usual high-dimension logistic regression exam- 
ples, where K is big and p is small, our large response leads to a likelihood that is expensive 
to evaluate (due to normalization of each q^) and has a dense information matrix (from|4j2, 
logLBD/dipjk = J2^=i ^i^ik^iji^ ~ Qij)^ which will not be zero unless Vik is). As a result. 



commonly used path algorithms that solve over a grid of shared A values (e.g., Friedman et al 



2010[ as implemented in gimnet for R) do not work even for the small examples of this article. 



We are thus motivated to develop super efficient estimation for sparse logistic regression. 
The independent gamma-Laplace priors of Section |3j3 are the first crucial aspect of our ap- 
proach: it remains necessary to choose hyperprior s and r, but results are robust enough to 
misspecification that basic defaults can be applied. Section |4}l derives the gamma-lasso (GL) 
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non-concave penalty that results from MAP estimation under this prior. Second, Section |4j2 
describes a coordinate descent algorithm for fast negative log posterior minimization wherein 
the GL penalties are incorporated at no extra cost over standard lasso regression. Lastly, |4j3 
considers conditions for posterior log concavity and provides a check for global convergence. 



4.1 Gamma-lasso penalized regression 

Our estimation framework relies upon recognition that optimal Xjk can always be written as a 
function of ^pjk, and thus does not need to be explicitly solved for in the joint objective. 

Proposition 4. 1 . MAP estimation for $ and A under the independent gamma-Laplace prior 
model in ([^ is equivalent to minimization of the negative log likelihood for $ subject to costs 

p K 

c($) = '^'^c{(fjk), where c{ipjk) = slog(l + \^Pjk\/r) (8) 
j=i k=i 

Proof. Under conjugate gamma priors, the conditional posterior mode for each Xjk given cpjk 
is available as \{^jk) = s/{r + Iv^ja,])- Any joint maximizing solution A] for (jvj) will 
thus consist of Xj^ = X{(pjk)', otherwise, it is always possible to increase the posterior by 
replacing Xjk- Taking the negative log of ^ and removing constant terms, the influence of a 
GL{Xjk, ifjk) prior on the negative log posterior is — s log(Ajfc) + (r + \Lpjk\)Xjk, which becomes 
-slog [(s/r)/(l + \(Pjk\/r)] + s (X slog(l + \(Pjk\/r) after replacing Xjk with X{(pjk). 

The implied penalty function is drawn in the left panel of Figure |2j Given its shape - every- 
where concave with a sharp spike at zero - our gamma-lasso estimation fits within the general 



framework of nonconcave penalized likelihood maximization as outlined in Fan and Li (2001 1 



and studied in many papers since. In particular, c{(fjk) can be seen as a reparametrization of 



the 'log-penalty' described in Mazunder et al.|(2011[ eq. 10), which is itself introduced in 



Friedman ( 2008| ) as a generalization of the elastic net. Viewing estimation from the perspective 



of this literature is informative. Like the standard lasso, singularity at zero in c{ipjk) causes 
some coefficients to be set to zero. However, unlike the lasso, the gamma-lasso has gradient 

c'{(pjk) = sign{(p jk)s/{r + \(pjk\) which disappears as \(pjk\ — )■ oo, leading to the property of 



unbiasedness for large coefficients listed by Fan and Li (2001 ) and referred to as Bayesian ro- 
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business by |Carvalho et al.| ( |2010[ ). Other results from this literature apply directly; for example, 
in most problems it should be possible to choose s and r to satisfy requirements for the strong 
oracle property of Fan and Peng ( 2004| ) conditional on their various likelihood conditions. 

It is important to emphasize that, despite sharing properties with cost functions that are 
purpose-built to satisfy particular notions of optimality, c{(pjk) occurs simply as a consequence 
of proper priors in a principled Bayesian model specification. To illustrate the effect of this 
penalty. Figure [T] shows MAP coefficients for a simple logistic regression under changes to data 
and parameterization. In each case, gamma-lasso estimates threshold to zero before jumping 
to solution paths that converge to the MLE with increasing evidence. Figure [2] illustrates how 
these solution discontinuities arise due to concavity in the minimization objective, an issue that 
is discussed in detail in Section |4j3. Note that although the univariate lasso thresholds at larger 
values than the gamma-lasso, in practice we often observe greater sparsity under GL penalties 
since large signals are less biased and single coefficients are allowed to account for the effect 
of multiple correlated inputs. In contrast, standard lasso estimates also fix some estimates at 
zero but lead to continuous solution paths that never converge to the MLE. 



4.2 Negative log posterior minimization by coordinate descent 

Taking negative log and removing constant factors, maximization equates with minimization 
of /($, a, U) + X]j=i(o^i/'^a)^ ~ ^ogTT{\J) + c($), where / is the strictly convex 



/($,a,U) = -^ 



1=1 



$'vi + Uj) - nii log I ^ exp(aj + c^j-'v^ + Uij] 
\j=i 



(9) 



Full parameter-set moves for this problem are prohibitively expensive in high-dimension due 
to (typically dense) Hessian storage requirements. Hence, feasible algorithms make use of 
coordinate descent (CD), wherein the optimization cycles through updates for each parame- 
ter conditional on current estimates for all other parameters (e.g., [Luenberger and Ye 2008 1. 
Although conditional minima for logistic regression are not available in closed-form, one can 
bound the CD objectives with an easily solvable function and optimize that instead. In such 



bound-optimization (also known as majorization; Lange et al. 2000) for, say, l{9), each move 
6'*^^ — )■ 0^ proceeds by setting new 0*^ as the minimizing argument to bound h{6), where h is such 
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-3 -1 1 2 3 -3 -1 1 2 3 -3 -1 1 2 3 



x'v 

Figure 1: Maximizing solutions for univariate logistic regression log posteriors L{ip) = :x'\rip — 
Y,i log [1 + e"^""^] - pen{ip), given v = [-1, -1, 1, 1]'. The dotted line is the MLE, with pen((^) = 0, 
the dashed line is lasso pen((^) = s|(^|/r, and the solid line is gamma-lasso pen{ip) = s log(l + |(^|/r). 



x'v = 1 .6 x'v = 2 



o 




phi phi phi 



Figure 2: The left panel shows gamma-lasso penalty slog(l + Iv^l/r) for [s, r] of [1, 1/2] (solid) and 
[3/2, 3/4] (dashed). The right two plots show the corresponding minimization objectives, negative log 
likelihood plus GL penalty, near a solution discontinuity in the simple logistic regression of Figure [T] 




-0.15 -0.05 0.05 1.35 1.45 1.55 -0.10 0.00 0.10 

phi [ chicl^en wing ] phi [ first date ] phi [ ate here ] 



Figure 3: Coordinate objective functions at convergence in regression of weSthere reviews onto overall 
rating. Solid lines are the true negative log likelihood and dashed lines are bound functions with 6 = 0.1. 
Both are shown for new tpj as a difference over the minimum at estimated ipj (marked with a dot). 
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that previous estimate 9^ ^ minimizes b{9) — l{9). Algorithm monotonicity is then guaranteed 
through the inequality l{9') = b{9') + l{9') - h{9') < b{9'-^) - [b{9'-^) - 1(9'-^)] = 1(9'-^). 

Using 9* to denote a new value for a parameter currently estimated at 9, a quadratic bound 
for each element of (|9]) conditional on all others is available through Taylor expansion as 

= cx, U) + gi{9){9* -9) + ^{9*- 9fHe (10) 

where gi{9) = dl/ 89 is the current coordinate gradient and He is an upper bound on curvature 
at the updated estimate, hi(9*) = dH/d9*'^. Quadratic bounding is also used in the logistic 
regression CD algorithms of |Krishnapuram et al.| ( |2005l ) and |Madigan et al.| ( |2005[ ): the former 



makes use of a loose static bound on hu while the latter updates He after each iteration to 
obtain tighter bounding in a constrained trust-region {9* E 9:L5} for specified 5 > 0. We have 
found that dynamic trust region bounding can lead to an order-of-magnitude fewer iterations, 
and Appendix A. 2 derives He as the least upper bound on hi{9*) for 9* within 5 of 9. 
In implementing this approach, coordinate- wise gradient and curvature for (pjk are 

dl " dH " 

9i{'^jk) = ^ = -y^^Vikixij-niiqij) and hi{ipjk) = = y^mivjj,qij{l-qij), (11) 

and similar functions hold for random effects and intercepts but with covariates of one and with- 
out summing over i for random effects. Then under normal, say N(/i6i, a^), priors for 9 = Uij 
or Oj, the negative log posterior bound is B{9*) = h{9*) + 0.5(6' — jieY /cil which is minimized 
in {9±5}^i9* = 9 - sgn(A0)min{|A0|, 5} with = [gi{9) + {9- ^ie)/<yl] / [He + l/fx^]. 

Although the GL penalty on ipjk is concave and lacks a derivative at zero, coordinate-wise 
updates are still available in closed form. Suppressing the jk subscript, each coefficient update 
under GL penalty requires minimization of B^ip*) = gi(ip)((f* — ip) + ^((p* — H^ + slog{l + 
\(p*\/r) within the trust region {if* E ip ±5 : sgn(y9*) = sgn((p)}. This is achieved by finding 
the roots of B'((p*) = and, when necessary, comparing to the bound evaluated at zero where 
B' is undefined. Setting B'((p*) = yields the quadratic equation 

+ (sgn(v9)r - + — - sgn{ip)r(p = (12) 
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with characteristic {sgn{(f)r + (p)^ — As/ H^, where ip = — gi{(f)/H^ would be the updated 
coordinate for an MLE estimator. From standard techniques, for {^p* : sgn((y9) = sgn((y9*)} 
this function will have at most one real minimizing root - that is, with > s/ (r + Iv^*])^- 
Hence, each coordinate update is to find this root (if it exists) and compare B^ip*) to -B(O). The 
minimizing value (0 or possible root ip*) dictates our parameter move A(/9, and this move is 
truncated at sgn(A(y9)5 if it exceeds the trust region. Finally, when (y9 = 0, repeat this procedure 
for both sgn(y9) = ±1; at most one direction will lead to a nonzero solution. 

As it is inexpensive to characterize roots for B'{(p*), the gamma-lasso does not lead to any 
noticeable increase in computation time over standard lasso algorithms (e.g., [Madigan et al. 
2005 p . Crucially, tests for decreased objective can performed on the bound function, instead 
of the full negative log posterior. Figure [3] shows objective and bound functions around the 
converged solution for three phrase loadings from regression of weSthere reviews onto overall 
rating. With 5 = 0.1, B provides tight bounding throughout this neighborhood. Behavior 
around the origin is most interesting: the solution for chicken wing, a low-loading negative 
term, is at B'(lp*) = just left of the singularity at zero, while ate here falls in the sharp point 
at zero. The neighborhood around first date, a high-loading term, is everywhere smooth. 

4.3 Posterior log concavity and algorithm convergence 

Since the gamma-lasso penalty is everywhere concave, our minimization objective is not guar- 
anteed to be convex. This is illustrated by the right two plots of Figure [2} where a very low- 
information likelihood (four observations) can be combined with a relatively diffuse prior on A 
(s = 1, r = 1/2) to yield concavity near zero. The effect of this is benign when the gradient 
is the same direction on either side of the origin (as in the right panel of |2]), but in other cases 
it will lead to local minima at zero away from the true global solution (as in the center panel). 
Such non-convexity is the cause of the discontinuities in the solution paths of Figure [T] 

From the second derivative of l{'Pjk) + c{ipjk), the conditional objective for ipji^ will be 
concave only if hi((pjk = 0) < s/r^ - that is, if prior variance on \jk is greater than the 
negative log likelihood curvature at Lpjk = 0. In our experience, this problem is rare: the 
likelihood typically overwhelms penalty concavity and real examples behave like those shown 
in Figure |3j Moreover, although it is possible to show stationary limit points for CD on such 
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nonconvex functions (e.g. |Mazunder et aL||201 we advocate avoiding the issue through prior 



specification. In particular, hyperprior shape and rate can be raised to decrease var(Ajfc) while 
keeping E[Ajfc] unchanged. Although this may require more prior information than desired, it 
is the amount necessary to have both fast MAP estimation and estimator stability. If you want 
to use more diffuse priors, you should pay the computational price of marginalization and mean 
inference (as in, e.g., Gramacy and Poison, 2012[ ). 



Even convexity in the coordinate updates is no guarantee of full objective convexity. How- 
ever, we close this section by showing that the joint problem of optimizing both A and $ is 
convex and has a single global minimum. Hence, we can derive gradient conditions on this 
expanded objective, and one is always able to check the estimation for global convergence. 



Proposition 4.2. [$,a,U] estimated following Sectional will correspond to the global 

MAP o/0 if and only ifG{ipjk) = sgn{ipjk)s/ (r + \ipjk\) - X]r=i Vik{xij - niiqij) is zero for 
(fjk 7^ 0, and is negative and positive in its left and right limits respectively around ipjk = 0. 

Proof. Since the objective for [q:,U] given $ is strictly convex, these will always be global 
conditional solutions. We can thus focus on the negative log conditional posterior for A], 
written /($) — J2^j=i J2k=i -^logl-^ifc) + + \Vjk\)^jk- With Xj^ > 0, the first two terms are 
convex in $ and A, respectively, and the third term is jointly convex (but not strictly so) in $ 
and A, such that this function has a single minimum with each component at either the origin or 
a point of zero gradient. Taking derivatives and replacing Xj^ = s/{r+ \(pjk\) yields G{(pjk). 

This simple result removes any uncertainty about global convergence, a standard issue with 
nonconcave penalization routines. Our fitted examples of Section B] all satisfy the test in (4.2 1. 



5 Examples 

We now apply our framework to the datasets of Section [T| 1 . The implemented software is 
available as the textir package for R, with these examples included as demos. Section [5] 1 
examines out-of-sample predictive performance, and is followed by individual data analyses. 



21 



5.1 A comparison of text regression methods 

Our prediction performance study considers three text analyses: both constituent percentage 
vote-share for G.W. Bush (bushvote) and Republican party membership (gop) regressed onto 
speech for a member of the 109*'^ US congress, and a user's overall rating (overall) regressed 
onto the content of their weSthere restaurant review. In each case, we report root mean square 
error or misclassification rate over 100 training and validation iterations. Full results and study 
details are provided in Appendix A. 3, and performance for a subset of models is plotted in 
Figure |4j Here, we focus on some main comparisons that can be drawn from the study. 

MNIR is considered under three different hyperprior specifications, with rate r = 1/2 and 
shapes of s = 1/100, 1/10, and 1. Response factors are Vi = yi for gop and overall, and Vi is 
set as Ui rounded by whole number for bushvote (note that instead setting Vi = here leads 
to no discernable improvement). In each case, MNIR is fit for observations binned by factor 
level. We consider models both with and without independent random effects. As predicted, 
performance is unaffected by random effects for discrete y-i, where we are collapsing together 
hundreds of observations. However, they do improve out-of-sample performance by approxi- 
mately 1.5% for bushvote, where only a small number of speakers are binned at each whole 
percentage point. Hence, detailed MNIR results are reported with random effects included only 
for bushvote. Finally, resulting SR scores Zi = (p'fj are incorporated into a variety of forward 
regression models: linear E[yj] = a + (3zi and quadratic E[?/j] = a + (3iZi + (32zf for bushvote, 
logistic E[yi] = exp[a; + (3zi\/{\ + exp[a + (3zi\) for gop, and linear and proportional-odds 
logistic p(|/j < c) = exp[ac — (5z.i\/ (1 + exp[a;c — c = 1 ... 5, for overall. 

Performance is very robust to changes in the MNIR hyperprior. Figure |4] shows little differ- 
ence between otherwise equivalent models using the conservative default s = 1 and the lowest 
expected penalty s = 1/100; results for s = 1/10 are squeezed in-between. In congressional 
speech examples s = 1/100 has a slight edge; phrases here have already been pre-selected 
for partisanship and are thus largely relevant to the sentiment. On the other hand, s = 1 is 
the best performing shape for the weSthere example, where phrases were only filtered by a 
minimum document threshold. Looking at forward regressions, the problem specific quadratic 
bushvote (see Section[5j2 for justification) and proportional odds overall (accounting for ordinal 
response) forward regressions provide lower average out-of-sample error rates at the price of 
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slightly higher variability across iterations, when compared to simple linear forward regression. 

As comparators, we consider text-specific LDA (both supervised and standard topic mod- 
els) as well as an assortment of generic regression techniques: lasso penalized linear (bushvote 
and overall) and binary logistic (gop) regression, with penalty either optimized under our 
gamma hyperpriors (gop), marginalized in MCMC (bushvote), or tuned through CV (all ex- 
amples); first-direction PLS (bushvote and overall); and support vector machines (gop). In 
every comparison, gamma-lasso MNIR provides higher quality predictions with lower run- 
times. The only similar predictive performance was for LDA with 25 and 50 topics in the 
bushvote example, at 15-50 times higher computational cost. Note that, given the size of real 
text analysis applications, we view the speed and scaleability of MNIR as a primary strength 
and only considered feasible alternatives, with short Gibbs runs for 50 topic sLDA and the 
Bayesian lasso (7-9 min) at the very high end of our runtimes. Moreover, both sLDA and CV 
lasso occasionally fail to converge (these runs were excluded); this never happened for MNIR. 

Among comparators, the multinomial topic models outperform generic alternatives. Inter- 
estingly, LDA combined with simple regression outperforms sLDA in both congress examples. 
Again, this is probably due to pre-selection of phrases: topics are relevant to ideology regard- 
less of supervision, and the extra parameters in sLDA are not worth their cost in degrees of 
freedom. Moreover, the simpler LDA models can be fit with the MAP estimation of Taddy| 



(2012b), whereas sLDA is applied here through a slow-to-converge Gibbs sampler (we note 
that the original sLDA paper uses a variational EM algorithm). However, in the weSthere data, 
the extra machinery of sLDA offers a clear improvement over unsupervised LDA, as should 
be the case in many text applications. Finally, in an important side comparison, binary logistic 
regressions were fit for gop regressed onto phrase frequencies using both CV and independent 
gamma hyperpriors for the lasso penalty. The scaleable, low-cost, gamma-lasso yields large 
performance improvements over a CV optimized model, regardless of hyperprior specification. 

5.2 Application: partisanship and ideology in political speeches 

For the data of Section [1] 1 . 1 , we have two sentiment metrics of interest: an indicator for party 
membership, and each speaker's constituent vote-share for Bush in 2004. Since the two inde- 
pendents caucused with Democrats, the former metric can be summarized in gop as a two-party 
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Figure 4: Out-of-sample performance and run-times for select models. For MNIR, 'Q' indicates 
quadratic and 'po' proportional-odds logistic forward regressions, while Xj prior '1' is Ga(0.01,0.5) 
and '3' is Ga(l, 0.5). We annotate with the number of topics for (s)LDA, and for binary Lasso regres- 
sions with either CV or the rate in an exponential penalty prior. Full details are in the appendix. 
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partisanship. Following the political economy notion that there should be little discrepancy be- 
tween voter and representative beliefs, bushvote provides a measure of ideology as expressed 
in support for G.W. Bush (and lack of support for John Kerry) in the context of that election. 

Figure |5] shows MNIR fit in separate models for each of gop and bushvote, as studied 
in Section [5] 1. For partisanship, fit with s = 1/100 and r = 1/2, a simple univariate lo- 
gistic forward regression yields clear discrimination between parties; 8.5% (45 speakers) are 
misclassified under a maximum probability rule. In the bushvote MNIR, fit under the same 
hyperprior but with inclusion of random effects, the resulting SR scores Zi = ip'ii increase 
quickly with vote-share at low (mostly Democrat) values and more slowly for high (mostly Re- 
publican) values. This motivates our quadratic forward regression for bushvote onto SR score, 
the predictive mean of which is plotted in Figure [s] (with i?^ of 0.5). However, looking at the 
SR scores colored by party (red for Republicans, blue Democrats, green independents) shows 
that this curvature could instead be explained through different forward regression slopes by 
level of gop, implying that the relationship between language and ideology is party-dependent. 

Given the above, a more useful model might consider text reduction that allows interaction 
between party and ideology. For example, we can build orthogonal bivariate sentiment factors 
as gop and bushvote minus the gop-level means, say votediff (again, rounded to the nearest 
whole percentage). Figure [6] shows fitted values for such a model, including random effects 
and with hyperprior shape increased to s = 1/10 to reflect a preference for smaller conditional 
coefficients. In detail, with Zgop and -Zvotediff the two dimensions of SR scores from MNIR 
X ~ MN(q(t'gop, f votediff ), "^), normalized for ease of interpretation, the fitted forward model is 

E[bushvote] = 51.9 + Q.2zgop + 5.22:votediff - l-Q^op-Zvotediff- (13) 

Thus a standard deviation increase in either SR direction implies a 5-6% increase in expected 
vote-share, and each effect is dampened when the normalized SR scores have the same sign. 

The right panel of Figure [6] shows fitted expected counts qjui against true nonzero counts 
in our bivariate MNIR model fit; with random effects to account for model misspecification, 
there appears to be no pattern of overdispersion. The only clear outlier in forward regression 
is Chaka Fattah (D-PA) with a standardized residual of -5.2; he uttered a token in our sample 
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Figure 5: Separate MNIR fits for congressional speecii onto eacii of party and vote-sliare. Tiie right 
sliows probabilities that each speaker is Republican and the left shows SR scores against bushvote. 
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Figure 6: Bivariate ideology and partisanship MNIR. The left plot shows fitted values for a forward 
regression that interacts SR scores, and the right shows fitted vs observed token counts in MNIR. 
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Select congressional speech term loadings in bivariate MNIR with party and vote-share. 
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only twice: once each for rate. return and billion. dollar. Finally, Figure |7] plots response factor 
loadings for a select group of tokens. Among other lessons, we see that racial identity rhetoric 
(african.american. latino, black.caucu) points towards the left wing of the Democratic party, 
while discussion of hate crimes is indicative of a moderate Republican. A few large loadings are 
driven by single observations: for example, violent. sexual. predator contributes more than 0.1% 
of speech for only Byron Dorgan, a Democratic Senator in Bush-supporting North Dakota. 
However, this is not the rule and most term loadings affect many speakers. 

5.3 Application: on-line restaurant reviews 

For the data of Section [T] 1.2, our sentiment consists of five correlated restaurant ratings (each 
on a five point scale) that accompany every review. The left panel of Figure [8] shows MNIR 
for review content regressed onto the single overall response factor, as studied in Section |5]1. 
The true overall rating has high correlation (0.7) with our SR scores, despite considerable over- 
lap between scores across rating levels. The right plot of Figure [8] shows probabilities for each 
increasing overall rating category, as estimated in the proportional-odds logistic forward regres- 
sion, p(overall < c) = exp[ac - /32;overaii]/(l + exp[ac - (3zoyera\\])- Again, ^overall is normalized 
here to have mean zero and standard deviation of one in our sample. This model has (3 = 2.3, 
implying that the odds of being at or above any given rating level are multiplied by e^-^ ^ 10 
for every standard deviation increase in the SR score. 

Looking to explore aspect- specific factors. Figure |9] shows top-30 absolute value loadings 
in MNIR for review token-counts onto all five dimensions of sentiment. Influential terms on 
either side of the rating spectrum can be easily connected with elements of a good or bad meal: 
plan. return, best. meal, and big. portion are good, while sent. back, servic.terribi, and food. bland 
are bad. The largest loadings appear to be onto overall and food aspects, with service slightly 
less important and loadings for value and atmosphere quickly decreasing in size. This would 
indicate that the reviews focus on these elements in that order. 
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Figure 8: Sufficient reduction and forward model fit for inverse regression of weSthere reviews onto the 
corresponding overall rating. The left plot shows SR score by true review rating, and the right shows 
proportional-odds logistic regression probabilities for each rating-level as a function of these SR scores. 
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Figure 9: High-loading phrases in each direction for regression of weSthere reviews onto aspect ratings. 
Green tokens are positive, black are negative, and size is proportional to the absolute value of the loading. 
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6 Discussion 



The promising results of Section [5] reinforce a basic idea: a workable inverse specification can 
introduce information that leads to more efficient estimation. Given the multinomial model as a 
natural inverse distribution for token-counts, analysis of sentiment in text presents an ideal set- 
ting for inverse regression. While the approach of not jointly modeling a corresponding forward 
regression falls short of full Bayesian analysis, such inference would significantly complicate 
estimation and detract from our goal of providing a fast default method for supervised docu- 
ment reduction. We are happy to take advantage of parametric hierarchical Bayesian inference 
for the difficult MNIR estimation problem, and suggest that application appropriate techniques 
for low-dimensional forward regression should be readily available. 

Although the illustrative applications in this article are quite simple, the methods scale to 
far larger datasets. Collapsing observations across sentiment factors for MNIR yields massive 
computational gains: training data need only include token counts tabled by sentiment level, 
and as an example, in |Taddy ( |2012a ) this allows MNIR runs of only a few seconds for 1.6 mil- 
Uon twitter posts scored as positive or negative. Moreover, we see no reason why gamma-lasso 
logistic regression, which was developed specifically for large response settings, should not be 
viewed as an efficient option in generic penalized regression. Finally, current collaborations 
that use MNIR for text analysis include study of partisanship in the US congressional record 
from 1873 to present, and an attempt to quantify the economic content of news in 20 years of 
Wall Street Journal editions. In each case, we are considering a more rigorous treatment of the 
identification of single sentiment dimensions and controlling for related endogenous variables; 
this work shows MNIR's promise as the basis for a variety of text related inference goals. 
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Appendix 



A.l Slant and Partial Least Squares 

The GS slant index for document i is zf^^^ = J^j^i ^jifij ~ %)/ Z]j=i with parameters 
obtained through ordinary least-squares (OLS) as [aj,bj] = argmina^t ^"^-^[/jj — (a+6?/j)]^ for 
j = . . .p. Since bj = cov(fj, y)/vai(y), slant is equivalent (up to a uniform shift and scale for 
all index values) to a weighted sum of term frequencies loaded by their covariance with y. This 



is also the first direction in partial least-squares; see Frank and Friedman ( 1993) for statistical 
properties of PLS and its relationship to OLS, and Hastie et al. ( 2009| ) for a common version 
of the algorithm. Using the usual normalization applied in PLS, an improved slant measure is 
given by = X]j=i /ijCor(/j , yi). For vote-share regressed onto congressional speech in 
the data of Section[T]l.l, this change increases within-sample from 0.37 to 0.57. 

Given F = [fi ■ • ■ fp] as a normalized covariate matrix with mean-zero and variance-one 
columns, a PLS algorithm which highlights its inverse regression structure is as follows. 

1. Set the initial response factor vq = y = [yi . . . and for A; = 1, . . . , i^: 

- Loadings are cp^ = cor(F, Vfc_i) = [cor(fi, v^^i) . . . cor(fp, Vfc_i)]'. 

- The k^^ PLS direction is = (p'^F. 

- The new response factors are = \'k-i — [z'^Vfe_i/(z'^.Zjt)]zfc. 

2. Set y as OLS fitted values for regression of y onto Z, where Z = [zi ■ ■ ■ zk]- 

An extra step to normalize and orthogonalize z^ with respect to [zi ■ ■ ■ Zk-i] recovers orthonor- 
mal directions, as in the original PLS algorithm. Moreover, loading calculations replaced by 
(fkj = argmin^ XlILil/^i — (a + ^Vki)]"^ will only scale z^ by the variance of and lead to 
the same forward fit, such that PLS can be viewed as stagewise inverse regression. 

A.2 Trust-region bound for logistic multinomial likelihood 



The bounding used here is essentially the same as in Genkin et al. (2007) but for introduction 

of dependence upon Vik that is missing from their version. We describe the bound for updates 

to (fjk, but it applies directly to aj or Uij simply by replacing covariate values with one. 

Given a trust region of cpjk ± S, the upper bound on hi{(pjk) = X]r=i %7e"^*5'y (1 ~ lij) 

Hjk = Yli=i ^ik'^i/^ij^ where each Fij is a lower bound on 1/ (qij - qfj) = 2 + e'^'^^^"''' / Eij + 

Eij / e^^^^^'"^^ , with Eij = J2f=i ^"^^ ~ ^^'^ ■ This target is convex in 5 with minimum at e^'"^'' = 

i?j,7e^"^ such that . , , 

e- E- ""^ 
Fij = ^ + + 2 where dj = < e'^'^+l"*''!'' if Eij > e'''^+l''"=l'^ 

Eij otherwise. 
We use unique 6jk and update 5*^^. = max{(5jfc/2, 2|(y9*^, — ^jk\} after each iteration. 
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